In this vignette, we will introduce how to use the R package LinCDE for conditional density estimation.

1 Background

The density of a continuous random variable characterizes its relative likelihood of taking a specific value. In statistics, many questions are essentially questions of density characteristics, e.g., location, variance, modality, and skewness. A fruitful collection of methods have been proposed for density estimation, such as kernel density estimation, Lindsey’s method.

In practice, the density of interest is often heterogeneous across observations. For instance, the height goes up in the mean from children to adults, the density of food expenditure is more variant among the higher-income community relative to the lower-income community, and the density of salary is bi-modal among lawyers while unimodal among firefighters. The heterogeneity in conditional densities often carries meaningful messages.

We propose LinCDE boost to estimate the conditional densities: a boosting algorithm based on LinCDE trees. A LinCDE tree partitions the covariate space into subregions with approximately locally homogeneous densities, employs Lindsey’s method for density estimation in the subregions, and aggregates the densities from different sub-areas as the estimated conditional density. LinCDE boost grows a number of LinCDE trees in a stagewise forward manner, and each tree fits the residuals of the previous estimate. For more details, please refer to the LinCDE paper.

2 LinCDE package

2.1 A toy example

We illustrate the workflow of the LinCDE package using a toy example. Before all, let us install and attach the LinCDE package.

library("LinCDE")

Next, we prepare the example dataset. The LinCDE boost estimator requires a covariate matrix and a response vector. In this example, we generate \(20\) covariates uniformly from \([-1,1]\). Given a covariate value, the response is locally Gaussian. The response depends on the covariates through its mean and variance. In particular, \(X_1\) and \(X_2\) influence the response’s mean and \(X_2\) also influences the response’s variance. The lattice plot below visualizes the conditional densities at \(9\) different feature points.

# locally Gaussian design (LGD)
density.LGD = function(X, y = NULL){
  if(is.null(dim(X))){X = matrix(X, nrow = 1)}
  if(!is.null(y)){
    dens = dnorm(y, mean = (0.5 * X[, 1] + X[, 1] * X[, 2]), sd = (0.5 + 0.25 * X[, 2]))
  }else{
    y = 0.5 * X[, 1] + X[, 1] * X[, 2] + rnorm(dim(X)[1], 0, 1) * (0.5 + 0.25 * X[ ,2])
  }
}
# feature points for visualization
d = 20; XProfile = matrix(0, nrow = 3^2, ncol = d); colnames(XProfile) = paste("X", seq(1,d), sep = "")
X1 = X2 = c(-0.6,0,0.6); XProfile[, c(1,2)] = as.matrix(expand.grid(X1, X2))
# true conditional density plots
densityPlot(X = XProfile, trueDensity = density.LGD, minY = -3, maxY = 3, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2],  sep = ""))

We conduct conditional density estimation based on \(1000\) training data points. We also generate independent validation and test samples of size \(1000\) for hyper-parameter tuning and performance evaluation, respectively.

set.seed(100)
# training data
n = 1000; X = matrix(runif(n * d, -1, 1), ncol = d); colnames(X) = paste("X", seq(1,d), sep = "")
y.LGD = density.LGD(X)
# validation data for tuning
nVal = 1000; XVal = matrix(runif(nVal * d, -1, 1), ncol = d); colnames(XVal) = paste("X", seq(1,d), sep = "")
yVal.LGD = density.LGD(XVal)
# test data
nTest = 1000; XTest = matrix(runif(nTest * d, -1, 1), ncol = d); colnames(XTest) = paste("X", seq(1,d), sep = "")
yTest.LGD = density.LGD(XTest)

2.1.1 Fitting

There are a few hyper-parameters critical to the performance of LinCDE boost. The two primary parameters of interest are the number of iterations (n.trees) and the depths of LinCDE trees (depth).

  • n.trees: the number of LinCDE trees to fit. We train a LinCDE boost model on the training data with a large number of trees, e.g., 100 trees. We then compute the validation log-likelihoods after each iteration on a separate validation dataset. We choose the number of iterations corresponding to the maximal validation log-likelihood.

    In the following example with depth = 2, we specify the maximal number of trees as \(100\). We evaluate the validation log-likelihood and plot the curve. The log-likelihood first increases then stabilizes at \(-0.76\) after \(50\) iterations. Therefore, we choose n.trees = 50 for depth = 2.

  • depth: the number of splits of each LinCDE tree. We apply LinCDE boost with a grid of depths. For each depth, we choose the number of iterations as above and record the associated validation log-likelihood. We choose the depth with the maximal validation log-likelihood.

    In the following example, we experiment with depth = 1 and depth = 2. For depth = 1, we choose 100 iterations, and the associated log-likelihood is \(-0.80\). For depth = 2, we choose 50 iterations, and the associated log-likelihood is \(-0.76\). Since \(-0.76 > -0.80\), we choose depth = 2. Notice that the example’s conditional mean involves an interaction term of \(X_1\) and \(X_2\), and thus depth = 1 — an additive model in the density’s exponent — is too restrictive compared to depth = 2 — a richer model including first-order interactions.

    In standard boosting, deep trees are problematic due to overfitting. In LinCDE boost, the overfitting issue is more severe than standard boosting because the density estimation problem at terminal nodes is more complicated than regression and classification. As a result, we do not recommend depth > 5.

model.LGD.D1 = LinCDE.boost(X = X, y = y.LGD, depth = 1, n.trees = 100, verbose = TRUE)
predict.LGD.D1 = predict(model.LGD.D1, X = XVal, y = yVal.LGD, densityOnly = FALSE)

model.LGD.D2 = LinCDE.boost(y = y.LGD, X = X, depth = 2, n.trees = 100, verbose = TRUE) 
predict.LGD.D2 = predict(model.LGD.D2, X = XVal, y = yVal.LGD, densityOnly = FALSE)

There are a number of secondary hyper-parameters. We recommend starting with the default values and making changes only if certain issues are observed.

  • basis: the type of spline basis. Default is the cubic natural spline basis. Cubic natural splines are desirable for their flexibility, smoothness, and linear extensions beyond the boundaries. However, if the conditional densities are believed to belong to a certain distribution family, then specific basis should be adopted. Here are two examples.

    • In the above example where the response is locally Gaussian, basis = “Gaussian” is the most appropriate choice.
    • For the modality and skewness examples below, the response can be locally bi-modal. basis = “Gaussian” can not produce bi-modality structures and a more flexible spline basis should be adopted.
  • splineDf: the number of spline basis. Default is \(10\). If you go with the natural cubic spline basis, then splineDf specifies the splines’ degrees of freedom, i.e., the number of spline bases. A larger splineDf is able to characterize more local structures but may produce unnecessary curvatures.

  • df: the ridge Poisson regression’s degrees of freedom. Default is \(2\). df is used for determining the ridge regularization hyper-parameter. A smaller df corresponds to a larger regularization parameter, and assists to avoid computational instabilities at subregions with a limited number of observations.

  • prior: type of the initial carrying density. Default is “Gaussian”, i.e., the Gaussian distribution with the marginal response mean and the standard deviation is used as the universal density initialization. If you set prior = “uniform”, the uniform distribution over the response range is used. If you set prior = “LindseyMarginal”, the marginal response density estimated by Lindsey’s method based on all responses is used. You can also input a homogeneous or heterogeneous conditional density function. The conditional density function should take a covariate matrix \(X\), a response vector \(y\), and output the densities at pairs \((X, y)\). If the prior conditional density is close to the underlying truth, e.g., a pretrained conditional density estimator, LinCDE boost will require less iterations.

In below, the prior distribution is similar to the true conditional density. It takes LinCDE boost about \(15\) iterations to converge, much faster compared to the \(50\) iterations with the Gaussian prior.

# input a heterogeneous conditional density function as prior
density.LGD.prior = function(X, y = NULL){
  if(is.null(dim(X))){X = matrix(X, nrow = 1)}
  if(!is.null(y)){
    dens = dnorm(y, mean = (0.25 * X[, 1] + 0.5 * X[, 1] * X[, 2]), sd = (0.5 + 0.25 * X[, 2]))
  }else{
    y = 0.25 * X[, 1] + 0.5 * X[, 1] * X[, 2] + rnorm(dim(X)[1], 0, 1) * (0.5 + 0.25 * X[ ,2])
  }
}

model.LGD.prior = LinCDE.boost(X = X, y = y.LGD, depth = 2, df = 2, n.trees = 100, prior = density.LGD.prior, verbose = T)
predict.LGD.prior = predict(model.LGD.prior, X = XVal, y = yVal.LGD, densityOnly = FALSE)

  • Standard boosting parameters:
    • splitPoint: a list of candidate splits or numbers of candidate split. Each element is a vector corresponding to a variable’s candidate splits (including the left and right endpoints). The list’s elements are ordered the same as \(X\)’s columns. An alternative input is candidate split numbers, a scalar if all variables share the same number of candidate splits, a vector of length nvars if variables have different numbers of candidate splits. If candidate split numbers are given, each variable’s range is divided into splitPoint-1 intervals, i.e., splitPoint knots, containing approximately the same number of observations. Default is \(20\). Note that if a variable has fewer unique values than the desired number of intervals, split intervals corresponding to each unique value are created.
    • shrinkage: the shrinkage parameter applied to each tree in the expansion, value in \((0,1]\). Default is \(0.1\). A smaller shrinkage leads to less overfitting but slower convergence.
    • terminalSize: the minimum number of observations in a terminal node. Default is \(20\). We do not recommend a very small terminalSize, since fitting a density model at each terminal node may involve quite a few parameters and a reasonable number of samples are needed.
  • Centering parameters:
    • centering: If true, a conditional mean model is fitted first, and LinCDE boost is applied to the residuals. The centering is recommended for responses whose conditional support varies wildly. See below for an example. Default is false.
    • centeringMethod: a conditional mean estimator. Applies only when centering is true. If centeringMethod = “linearRegression”, a regression model is fitted to the response. If centeringMethod = “randomForest”, a random forest model is fitted. Default is fitting a random forest. Applies only to centering = TRUE. If centeringMethod is a function, the call centeringMethod(y, X) should return a conditional mean model with a predict function. Default is “randomForest”.

In the above example, we set hyperparameters splineDf = “Gaussian”, shrinkage = 0.02, depth = 3, n.trees = 300, and leave other parameters at defualt values.

model.LGD.final = LinCDE.boost(y = y.LGD, X = X, basis = "Gaussian", depth = 3, n.trees = 300, shrinkage = 0.02)

2.1.2 Prediction

With a LinCDE boost model, we can predict conditional densities of an independent test dataset via function predict and evaluate LinCDE boost’ performance. In the paper, we recommend the assessment metric: the relative improvement in the test log-likelihood \[\begin{align*} \frac{\ell_{\text{LinCDE boost}} - \ell_{\text{null}}}{\ell_{\text{oracle}} - \ell_{\text{null}}}, \end{align*}\] where the null model is the universal Gaussian distribution with the response’s marginal mean and standard deviation, and the oracle denotes the true underlying conditional density. The criterion is analogous to the goodness-of-fit measure \(R^2\) of linear regression. In this example, LinCDE boost achieves \(80.9\%\) of the oracle’s improvement over the null model.

# LinCDE boost
predict.LGD.final = predict(model.LGD.final, X = XTest, y = yTest.LGD, densityOnly = FALSE)

# null model
model.LGD.null = LinCDE.boost(X = X, y = y.LGD, basis = "Gaussian", n.trees = 0)
predict.LGD.null = predict(model.LGD.null, X = XTest, y = yTest.LGD,  densityOnly = FALSE)
# oracle
oracle = mean(log(density.LGD(X = XTest, y = yTest.LGD)))
# relative improvement 
(relativeImprovement =   
  (predict.LGD.final$testLogLikelihood - predict.LGD.null$testLogLikelihood)/  
  (oracle - predict.LGD.null$testLogLikelihood))
#> [1] "relative improvement: 80.9%"

When the true density is unknown, the relative improvement can’t be computed. However, we can still obtain and use the absolute test log-likelihood as an assessment of LinCDE boost’s performance. Besides, the visualization tools below help evaluate fitted LinCDE boost models.

2.1.3 Visualization

For visualization, we plot the estimated conditional densities against the truth at the above \(9\) selected feature points. The estimated conditional densities are close to the truth.

# conditional density plot
densityPlot(X = XProfile, trueDensity = density.LGD, model = model.LGD.final, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2],  sep = ""))

To identify the influential covariates, we plot the importance scores for each covariate. The importance score barplot indicates that the first two candidates contribute the most to the improvement in the objective, which agrees with the underlying density model.

summary(model.LGD.final, cBars = 8)

#>    var    rel.inf
#> 1   X2 36.3531030
#> 2   X1 29.9086800
#> 3  X20  8.4665390
#> 4  X19  6.8705365
#> 5  X14  5.4844255
#> 6  X10  3.8044124
#> 7  X18  3.5250123
#> 8   X4  2.0696343
#> 9  X13  1.6067923
#> 10 X12  1.1105122
#> 11  X9  0.8003525
#> 12  X3  0.0000000
#> 13  X5  0.0000000
#> 14  X6  0.0000000
#> 15  X7  0.0000000
#> 16  X8  0.0000000
#> 17 X11  0.0000000
#> 18 X15  0.0000000
#> 19 X16  0.0000000
#> 20 X17  0.0000000

2.2 More examples

The above example shows LinCDE boost’s ability to capture location and scale changes. We add two more examples focusing on modality and skewness (asymmetry), respectively.

For modality, we generate locally Gaussian mixture responses if \(X_2 \le 0.2\) and locally Gaussian responses if \(X_2 > 0.2\). Meanwhile, we let the responses’ locations depend on \(X_1\). We follow the aforementioned workflow and set the hyper-parameters at shrinkage = 0.05, depth = 3, n.trees = 200. We use a larger df to learn the bi-modality. The rest of the parameters are at default values.

# data generation parameters
density.modality = function(X, y = NULL){
  if(is.null(dim(X))){X = matrix(X, nrow = 1)}
  if(!is.null(y)){
    dens = dnorm(y, mean = 0.25 * X[,1], sd = 0.8)
    index = which(X[,2] < 0.2)
    dens[index] = 0.5 * dnorm(y[index], mean = 0.25 * X[index, 1] + 0.8, sd = 0.3) + 0.5 * dnorm(y[index], mean = 0.25 * X[index, 1] - 0.8, sd = 0.3)
    return(dens)
  }else{
    n = dim(X)[1]
    groupIndex = rbinom(n, 1, 0.5)
    y = groupIndex * rnorm(n, -0.8, 0.3) + (1-groupIndex) * rnorm(n, 0.8, 0.3)
    y = 0.25 * X[,1] + (X[,2] <= 0.2) * y + (X[,2] > 0.2) * rnorm(n, 0, 0.8)
  }
}
set.seed(1)
y.modality = density.modality(X)
model.modality = LinCDE.boost(y = y.modality, X = X, df = 8,
                                 depth = 3, n.trees = 200, shrinkage = 0.05)

We compare the estimated conditional densities against the truth at the \(9\) selected feature points. The estimated conditional densities are clearly bi-modal for \(X_2 = -0.6\) and \(X_2 = 0\). For \(X_2 = 0.6\), the estimated conditional densities are largely Gaussian with mild curvatures in the middle. In summary, LinCDE boost identifies \(X_1\) and \(X_2\) as the most important covariates correctly.

# conditional density plot 
densityPlot(X = XProfile, trueDensity = density.modality, model = model.modality, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2],  sep = "")) 

summary(model.modality, cBars = 8)

#>    var   rel.inf
#> 1   X2 15.784887
#> 2   X1 12.961287
#> 3  X19  6.936327
#> 4  X15  6.250048
#> 5   X5  5.846550
#> 6  X18  5.766242
#> 7  X16  5.035929
#> 8  X17  4.901791
#> 9   X8  4.522908
#> 10  X4  4.473423
#> 11  X3  4.199582
#> 12 X12  4.082083
#> 13  X9  3.440578
#> 14 X14  3.146878
#> 15 X13  2.813074
#> 16 X10  2.644095
#> 17 X20  2.161298
#> 18 X11  2.025661
#> 19  X6  1.692540
#> 20  X7  1.314820

For symmetry, we generate asymmetric Gaussian mixture responses. If \(X_2 > 0\), the right modal is sharper; if \(X_2 < 0\), the left modal is sharper. The larger the \(|X_2|\) is, the more asymmetric the conditional distribution is. Meanwhile, we let the responses’ locations depend on \(X_1\). We follow the aforementioned workflow and set the hyper-parameters at shrinkage = 0.1, depth = 3, n.trees = 200, and df = 8. The rest parameters are at default values.

We compare the estimated conditional densities against the truth at \(9\) landmarks. The estimated conditional densities are right-skewed for \(X_2 = 0.6\), left-skewed for \(X_2 = -0.6\), and symmetric for \(X_2 = 0\). The \(X_1\) and \(X_2\) are correctly identified as important covariates.

# data generation parameters
density.skewness = function(X, y = NULL){
  if(is.null(dim(X))){X = matrix(X, nrow = 1)}
  if(!is.null(y)){
    dens = 0.5 * dnorm(y, mean = 0.5 * X[, 1] + 0.8, sd = 0.5 + 0.45 * X[, 2]) +   
    0.5 * dnorm(y, mean = 0.5 * X[, 1] - 0.8, sd = 0.5 - 0.45 * X[, 2])
  }else{
    n = dim(X)[1]
    groupIndex = rbinom(n, 1, 0.5)
    y = groupIndex * rnorm(n, 0.8, 0.5 + 0.45 * X[, 2]) + (1 - groupIndex) * rnorm(n, -0.8, 0.5 - 0.45 * X[, 2]) + 0.5 * X[, 1]
  }
}

set.seed(1)
y.skewness = density.skewness(X)
model.skewness = LinCDE.boost(X = X, y = y.skewness, df = 8,
                                 depth = 3, n.trees = 200, shrinkage = 0.1)
# conditional density plot
densityPlot(X = XProfile, trueDensity = density.skewness, model = model.skewness, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2],  sep = ""))

summary(model.skewness, cBars = 8)

#>    var    rel.inf
#> 1   X1 21.9539935
#> 2   X2 17.0168610
#> 3  X16  8.7661948
#> 4  X10  7.5742768
#> 5   X3  5.1747030
#> 6  X20  4.9460029
#> 7  X14  4.3826646
#> 8   X4  4.1571933
#> 9   X6  4.0269976
#> 10 X18  3.3791925
#> 11 X15  3.1927073
#> 12  X9  2.7585407
#> 13  X8  2.5577536
#> 14  X5  2.5184916
#> 15 X12  2.5166658
#> 16  X7  2.2210824
#> 17 X13  0.9773898
#> 18 X17  0.9410647
#> 19 X19  0.9382240
#> 20 X11  0.0000000

2.3 Centering

If a distribution’s conditional components differ violently in location, then the response discretization in Lindsey’s method could be problematic. In any local area, only a few bins are effective, and the rest see no observations. Such grouping is coarse, and there are no sufficient degrees of freedom to capture the distributional properties. We call this the “disjoint support” problem.

We propose to solve the “disjoint support” problem by centering. In particular, we suggest aligning the centers of the conditional densities in advance by estimating the conditional means first and subtracting the estimates from the responses. The residuals’ supports are less heterogeneous, and we apply LinCDE boost to the residuals to capture additional distributional structures.

In the following example, we generate locally Gaussian mixture responses if \(X_2 \le 0.2\), and locally Gaussian responses if \(X_2 > 0.2\). Meanwhile, we let the responses’ supports differ dramatically as \(X_1\) changes, and thus the “disjoint support” problem is present.

We follow the aforementioned workflow and set the hyper-parameters at shrinkage = 0.02, depth = 3, n.trees = 200 (the rest parameters are at default values). We compare LinCDE boost with and without centering. For centering, we use linear regression to estimate the conditional mean. In the relative influence plots, we stack importances from centering (in red) and beyond centering (in blue). \(X_1\) accounts for most of the importances in the centering, and \(X_2\) is the most important covariate beyond centering. Currently the decomposition of importances into centering and beyond centering is only available for centeringMethod = “linearRegression” or centeringMethod = “randomForest”.

We compute the estimated conditional densities against the truth at \(9\) landmarks. Without centering, LinCDE boost identifies the location shift but the predicted conditional densities are unimodal everywhere. With centering, LinCDE boost manages to produce the bi-modal structure for \(X_2 \le 0.2\).

# data generation parameters
density.centering = function(X, y = NULL){
  if(is.null(dim(X))){X = matrix(X, nrow = 1)}
  if(!is.null(y)){
    dens = dnorm(y, mean = 4 * X[, 1], sd = 0.8) 
    index = which(X[, 2] < 0.2)
    dens[index] = (0.5 * dnorm(y[index], mean = 4 * X[index, 1] + 0.8, sd = 0.3) + 0.5 * dnorm(y[index], mean = 4 * X[index, 1] - 0.8, sd = 0.3))
    return(dens)
  }else{
    n = dim(X)[1]
    groupIndex = rbinom(n, 1, 0.5)
    y = groupIndex * rnorm(n, -0.8, 0.3) + (1 - groupIndex) * rnorm(n, 0.8, 0.3)
    y = 4 * X[, 1] + (X[, 2] <= 0.2) * y + (X[, 2] > 0.2) * rnorm(n, 0, 0.8)
  }
}

set.seed(1)
y.centering = density.centering(X)
  
# without centering
model.centering.no = LinCDE.boost(X = X,  y = y.centering, depth = 2, n.trees = 200, shrinkage = 0.02, splineDf = 10)
# with centering
model.centering.OLS = LinCDE.boost(X = X, y = y.centering, df = 8, depth = 2, n.trees = 200, shrinkage = 0.02, centering = TRUE, centeringMethod = "linearRegression")
# without centering
summary(model.centering.no, cBars = 8)

#>    var rel.inf
#> 1   X1     100
#> 2   X2       0
#> 3   X3       0
#> 4   X4       0
#> 5   X5       0
#> 6   X6       0
#> 7   X7       0
#> 8   X8       0
#> 9   X9       0
#> 10 X10       0
#> 11 X11       0
#> 12 X12       0
#> 13 X13       0
#> 14 X14       0
#> 15 X15       0
#> 16 X16       0
#> 17 X17       0
#> 18 X18       0
#> 19 X19       0
#> 20 X20       0
# without centering
densityPlot(X = XProfile, trueDensity = density.centering, model = model.centering.no) 

# with centering
summary(model.centering.OLS, cBars = 8)

#> $location
#>    var     rel.inf
#> 1   X1 48.67170894
#> 2   X2  1.00455822
#> 3   X4  1.25320551
#> 4  X15  1.35724808
#> 5   X8  0.78505622
#> 6  X19  1.01778142
#> 7   X5  0.11011730
#> 8   X7  0.70497903
#> 9  X18  0.97139414
#> 10 X12  0.26730585
#> 11  X3  0.53809259
#> 12 X17  0.61531201
#> 13  X6  0.43059357
#> 14 X11  0.35993610
#> 15 X20  0.28815834
#> 16  X9  0.20157656
#> 17 X14  0.17681822
#> 18 X16  0.10398081
#> 19 X13  0.09084538
#> 20 X10  0.05469914
#> 
#> $beyondLocation
#>    var    rel.inf
#> 1   X1  0.9966768
#> 2   X2 13.9411148
#> 3   X4  6.9998472
#> 4  X15  4.4420828
#> 5   X8  2.8556823
#> 6  X19  2.5691209
#> 7   X5  3.4664395
#> 8   X7  2.5975351
#> 9  X18  1.3300843
#> 10 X12  1.2039489
#> 11  X3  0.5941000
#> 12 X17  0.0000000
#> 13  X6  0.0000000
#> 14 X11  0.0000000
#> 15 X20  0.0000000
#> 16  X9  0.0000000
#> 17 X14  0.0000000
#> 18 X16  0.0000000
#> 19 X13  0.0000000
#> 20 X10  0.0000000
# with centering
densityPlot(X = XProfile, trueDensity = density.centering,  model = model.centering.OLS, factor.levels = paste("X1=", XProfile[,1], ", X2=", XProfile[,2],  sep = ""))

We remark that users can also input a function as the centeringMethod. We will use centeringMethod(y, X) to center the responses. In prediction, we will apply the predict function to the centering model. Below we define the centeringMethod to be the standard linear regression and feed it to LinCDE.boost. The LinCDE boost model model.centering.DIY should be the same as if we let centeringMethod = “linearRegression”, i.e., the model model.centering.OLS above.

meanMethod = function(y, X){
  meanModelData = as.data.frame(cbind(y, X))
  model = lm(y~., data = meanModelData)
}
model.centering.DIY = LinCDE.boost(X = X, y = y.centering, df = 8, depth = 2, n.trees = 200, shrinkage = 0.02, centering = TRUE, centeringMethod = meanMethod)

2.4 California housing example

We read in the California Housing data, remove windsorized responses, and divide the responses by \(1000\). For time efficiency, we randomly select \(1000\) observations as the training data.

cahousing=read.csv("./data/cahousing.csv", header = T) 
y.cahousing = cahousing[,1]; cahousing=cahousing[y.cahousing < max(y.cahousing),]
y.cahousing = cahousing[,1]/1000
X = data.matrix(cahousing[,-1])
set.seed(318); indexTrain = sample(length(y.cahousing), 1000, replace = FALSE)

We start from conditional estimation. We fit a simple gbm model on the training data. The most influential covariate to the conditional mean is median income (MedInc), followed by longtitude, lattitude, and average occupancy (AveOccup).

meanModel.cahousing = gbm::gbm.fit(y = y.cahousing[indexTrain], x = X[indexTrain, ], distribution = "gaussian", interaction.depth = 3, n.trees = 1000, shrinkage = 0.02, verbose = F)
summary(meanModel.cahousing)

#>                   var   rel.inf
#> MedInc         MedInc 46.974127
#> Longitude   Longitude 13.310592
#> AveOccup     AveOccup 12.341893
#> Latitude     Latitude 11.317929
#> HouseAge     HouseAge  5.119188
#> AveRooms     AveRooms  4.385236
#> AveBedrms   AveBedrms  3.288512
#> Population Population  3.262522

Next, we fit a simple LinCDE boost model with sufficient statistics \(y\),\(y^2\) as a start. Compared to the above conditional mean estimation, the LinCDE boost model also incorporates the variance information. The importances are similar to those from the gbm model: MedInc is the dominant covariate followed by AveOccup, Longitude, and Latitude. Let us plot the conditional density estimates for different (MedInc, AveOccup) pairs. Since we don’t know the underlying truth, we plot the histograms of the samples with similar covariate values as our reference. (The similarity of samples to a target point is measured by the covariate Mahalanobis distance weighted by the covariate importances from the LinCDE boost model.)

model.cahousing.Gaussian = LinCDE.boost(y = y.cahousing[indexTrain], X=X[indexTrain,], basis = "Gaussian", depth=3, n.trees=150, terminalSize=50, shrinkage = 0.05, minY = 0, maxY = 510)
summary(model.cahousing.Gaussian)

#>          var   rel.inf
#> 1     MedInc 30.488859
#> 2   AveOccup 16.912307
#> 3  Longitude 14.825653
#> 4   Latitude 11.179810
#> 5   HouseAge  8.989265
#> 6 Population  6.436688
#> 7   AveRooms  6.201678
#> 8  AveBedrms  4.965740
X.MedInc = quantile(X[,"MedInc"],c(.2,.5,.8)) # 20%, 50%, 80% quantiles of MedInc
X.AveOccup = quantile(X[,"AveOccup"],c(.2,.5,.8))  # 20%, 50%, 80% quantiles of AveOccup
XGrid=matrix(colMeans(X), byrow=TRUE, nrow=9, ncol=ncol(X)); colnames(XGrid) = colnames(X) # other covariates fixed at the sample means
XGrid[,c("MedInc", "AveOccup")] = data.matrix(expand.grid(X.MedInc, X.AveOccup))
yGrid = seq(0,510,length.out = 100) # response grid
plot.cahousing.Gaussian = densityPlot(X = XGrid, yGrid = yGrid, model=model.cahousing.Gaussian, plot = FALSE)

The estimated conditional densities seem to agree with the histograms in the dependencies of income and occupancy: the conditional mean increases as the income increases (from left to right), and the conditional variance decreases as the occupancy grows (from bottom to top) in the middle and right columns. Note that the histograms do not look Gaussian and let’s consider a more flexible basis for LinCDE boost.

Below we use \(10\) transformed natural cubic splines with \(6\) degrees of freedom.

model.cahousing.ns = LinCDE.boost(y=y.cahousing[indexTrain],X=X[indexTrain,], basis = "nsTransform", depth=3, splineDf = 10, df = 6, n.trees=300, terminalSize=50, shrinkage = 0.05, minY = 0, maxY = 510, verbose = TRUE)
summary(model.cahousing.ns)

#>          var   rel.inf
#> 1     MedInc 31.297311
#> 2  Longitude 14.588574
#> 3   AveOccup 14.468606
#> 4   Latitude 14.453976
#> 5   AveRooms 10.808165
#> 6   HouseAge  7.027820
#> 7  AveBedrms  4.317568
#> 8 Population  3.037978

The importances look similar. We make a plot of the estimated conditional densities at the same covariate configurations. As above, the conditional density’s location shifts to the right as the income increases, and the conditional density’s spread decreases as the occupancy grows in the middle and right columns. Different from above, the estimated conditional densities have diverse shapes beyond Gaussian. In local regions of \(2.32\) (\(20\%\) quantile) or \(3.45\) (\(50\%\) quantile) median incomes, the conditional distributions appear right-skewed. In regions of \(3.45\) (\(50\%\) quantile) median income and \(2.36\) (\(20\%\) quantile), \(2.84\) (\(50\%\) quantile) average occupancies, the conditional distributions have flatter peaks compared to the Gaussian distribution.

plot.cahousing.ns = densityPlot(X=XGrid, yGrid = yGrid, model = model.cahousing.ns, plot = FALSE) 

2.5 Plug in your own data and have fun with LinCDE!